Goto

Collaborating Authors

 keyboard and mouse


JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Li, Muyao, Wang, Zihao, He, Kaichen, Ma, Xiaojian, Liang, Yitao

arXiv.org Artificial Intelligence

Recently, action-based decision-making in open-world environments has gained significant attention. Visual Language Action (VLA) models, pretrained on large-scale web datasets, have shown promise in decision-making tasks. However, previous work has primarily focused on action post-training, often neglecting enhancements to the foundational model itself. In response, we introduce a novel approach, Act from Visual Language Post-Training, which refines Visual Language Models (VLMs) through visual and linguistic guidance in a self-supervised manner. This enhancement improves the models' capabilities in world knowledge, visual recognition, and spatial grounding in open-world environments. Following the above post-training paradigms, we obtain the first VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. Our experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, we demonstrate that our approach surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance. We have open-sourced the code, models, and datasets to foster further research. The project page can be found in https://craftjarvis.github.io/JarvisVLA.


Cradle: Empowering Foundation Agents Towards General Computer Control

Tan, Weihao, Zhang, Wentao, Xu, Xinrun, Xia, Haochong, Ding, Ziluo, Li, Boyu, Zhou, Bohan, Yue, Junpeng, Jiang, Jiechuan, Li, Yewen, An, Ruyi, Qin, Molei, Zong, Chuqiao, Zheng, Longtao, Wu, Yujie, Chai, Xiaoqiang, Bi, Yifei, Xie, Tianbao, Gu, Pengjie, Li, Xiyun, Zhang, Ceyao, Tian, Long, Wang, Chaojie, Wang, Xinrun, Karlsson, Börje F., An, Bo, Yan, Shuicheng, Lu, Zongqing

arXiv.org Artificial Intelligence

Despite the success in specific scenarios, existing foundation agents still struggle to generalize across various virtual scenarios, mainly due to the dramatically different encapsulations of environments with manually designed observation and action spaces. To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i.e., using screenshots as input and keyboard and mouse actions as output. We introduce Cradle, a modular and flexible LMM-powered framework, as a preliminary attempt towards GCC. Enhanced by six key modules, Cradle can understand input screenshots and output executable code for low-level keyboard and mouse control after high-level planning, so that Cradle can interact with any software and complete long-horizon complex tasks without relying on any built-in APIs. Experimental results show that Cradle exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games, five software applications, and a comprehensive benchmark, OSWorld. Cradle is the first to enable foundation agents to follow the main storyline and complete 40-minute-long real missions in the complex AAA game Red Dead Redemption 2 (RDR2). Cradle can also create a city of a thousand people in Cities: Skylines, farm and harvest parsnips in Stardew Valley, and trade and bargain with a maximal weekly total profit of 87% in Dealer's Life 2. Cradle can not only operate daily software, like Chrome, Outlook, and Feishu, but also edit images and videos using Meitu and CapCut. Cradle greatly extends the reach of foundation agents by enabling the easy conversion of any software, especially complex games, into benchmarks to evaluate agents' various abilities and facilitate further data collection, thus paving the way for generalist agents.


Combinatorial Reasoning: Selecting Reasons in Generative AI Pipelines via Combinatorial Optimization

Esencan, Mert, Kumar, Tarun Advaith, Asanjan, Ata Akbari, Lott, P. Aaron, Mohseni, Masoud, Unlu, Can, Venturelli, Davide, Ho, Alan

arXiv.org Artificial Intelligence

Recent Large Language Models (LLMs) have demonstrated impressive capabilities at tasks that require human intelligence and are a significant step towards human-like artificial intelligence (AI). Yet the performance of LLMs at reasoning tasks have been subpar and the reasoning capability of LLMs is a matter of significant debate. While it has been shown that the choice of the prompting technique to the LLM can alter its performance on a multitude of tasks, including reasoning, the best performing techniques require human-made prompts with the knowledge of the tasks at hand. We introduce a framework for what we call Combinatorial Reasoning (CR), a fully-automated prompting method, where reasons are sampled from an LLM pipeline and mapped into a Quadratic Unconstrained Binary Optimization (QUBO) problem. The framework investigates whether QUBO solutions can be profitably used to select a useful subset of the reasons to construct a Chain-of-Thought style prompt. We explore the acceleration of CR with specialized solvers. We also investigate the performance of simpler zero-shot strategies such as linear majority rule or random selection of reasons. Our preliminary study indicates that coupling a combinatorial solver to generative AI pipelines is an interesting avenue for AI reasoning and elucidates design principles for future CR methods.


How to Play Your Favorite Google Play Mobile Games on PC (2023)

WIRED

Hopefully, Google will add more titles soon. Any developers interested in making their Android games compatible can get started at the official Android developer's website. The Google Play Games for PC service works with Google Play Points, so you can earn points for purchases (including subscriptions and in-app purchases) just as you would on an Android device. Any points you accumulate can be redeemed for vouchers and special game offers in the Play Store. Once you start a game, you can press Shift and Tab to access the menu, where you can change the screen resolution, tweak the volume, and remap the game controls.


Effective Gesture Based Framework for Capturing User Input

Charan, Pabbathi Sri, Gupta, Saksham, Agrawal, Satvik, Sindhu, Gadupudi Sahithi

arXiv.org Artificial Intelligence

Computers today aren't just confined to laptops and desktops. Mobile gadgets like mobile phones and laptops also make use of it. However, one input device that hasn't changed in the last 50 years is the QWERTY keyboard. Users of virtual keyboards can type on any surface as if it were a keyboard thanks to sensor technology and artificial intelligence. In this research, we use the idea of image processing to create an application for seeing a computer keyboard using a novel framework which can detect hand gestures with precise accuracy while also being sustainable and financially viable. A camera is used to capture keyboard images and finger movements which subsequently acts as a virtual keyboard. In addition, a visible virtual mouse that accepts finger coordinates as input is also described in this study. This system has a direct benefit of reducing peripheral cost, reducing electronics waste generated due to external devices and providing accessibility to people who cannot use the traditional keyboard and mouse.


Back-to-school shopping: How to buy the right computer for students of any age

USATODAY - Tech Top Stories

A new school season is upon us – cue the rolling eyes, students – and so you might be in the market for a new computer. Whether you're back in the classroom or continuing to learn online, it's simply the most important piece of tech to help you stay on your game. Problem is, how do you decide what kind of computer is for you? Not only are there varying prices, competing operating systems and countless brands to choose from, but the student – or the parent(s) footing the bill – must decide on an ideal form factor (or type of computer), such as a laptop, desktop, 2-in-1 or all-in-one. And you might think you need a degree in computer science just to understand today's specifications ("specs").


Google's DeepMind AI takes on StarCraft II

#artificialintelligence

At BlizzCon earlier this month in Anaheim, California, Blizzard announced an ambitious new project in collaboration with DeepMind, a leading artificial intelligence research company acquired by Google in 2014. After creating the AlphaGo AI that bested the world's top Go player earlier this year, DeepMind's next groundbreaking challenge will be StarCraft II. If DeepMind is able to build an AI that could learn how to beat top players such as Byun "ByuN" Hyun Woo in the complex real-time strategy, tactics and resource management of this game, it would be a giant step forward in AI research. And with DeepMind's interest in using its research to solve hard problems in areas such as healthcare and energy efficiency on a massive scale, this Starcraft II project could impact the whole world. Soon after AlphaGo's Go victory, there were signs that DeepMind would take on StarCraft next. This was not lost on legendary StarCraft player/commentator and former competitive chess player Dan "Artosis" Stemkosi, for whom StarCraft seemed like the logical next step for AI research after games like chess and Go.


Developing a Language for Spoken Programming

Gordon, Benjamin M. (University of New Mexico)

AAAI Conferences

The dominant paradigm for programming a computer today is text entry via keyboard and mouse, but there aremany common situations where this is not ideal. I address this through the creation of a new language thatis explicitly intended for spoken programming. In addition, I describe a supporting editor that improvesrecognition accuracy by making use of type information and scoping to increase recognizer context.